This is part of my current paper about political reporting in German online news.
To measure the ideological content of several major online news services, I compare the topics discussed in these media with the press releases of the Bundestag parties using a structural topic model.
The following is an analysis of the content of the press releases scraped from the public websites of the political parties and political groups. A big part of this analysis is inspired from the work of Julia Silge and David Robinson (Text Mining with R - A Tidy Approach).
I assume that parties utilize their press releases to promote their issues and positions and thus also contribute to the election campaign. However, it should be noted that there is a difference between the press releases of the parties and the factions. Parties are financed by membership dues, donations and campaign expenses, while factions are financed by state funds. According to Parteigesetzt §25 (2) state funded factions may not support parties from their funds, because otherwise parties that are not in the Bundestag would be practically disadvantaged.
Since it is difficult to draw the line between faction activity and election campaign assistance, I assume that factions intervene in the public perception of this party with their press releases, which is why I include both the press releases of the federal party and the federal faction.
| title_text | text_cleaned | |
|---|---|---|
| 760 | Uwe Witt begrüßt Vorschlag der Rentenversicherung zur steuerfinanzierten Mütterrente . Januar 2018. Die Deutsche Rentenversicherung warnt die „GroKo-Sondierer“ CDU/CSU und SPD in ne neue Regierungskoalition müsse eine Finanzierung der Mütterrente regeln. Ein Ausbau dürfe nicht zu Lasten des Beitragszahlers gehen. „Alle Mehrausgaben, die der Rentenversicherung durch die Finanzierung zusätzlicher Mütterrenten für Geburten vor 1992 entstehen, müssen sach- und systemgerecht aus Steuermitteln finanziert werden“, heißt es in einem in der Nacht zum Dienstag verbreiteten Beschluss der Bundesvertreterversammlung.Der Bundestagsabgeordnete Uwe Witt, kommissarischer Sprecher des Arbeitskreises „Arbeit & Soziales der AfD-Bundestagsfraktion und Leiter des Bundesfachausschuss 11 (Soziale Sicherungssysteme und Rente, Arbeits- und Sozialpolitik) ist, hat das Nein der Alternative für Deutschland (AfD) zu einer Ausweitung der Mütterrente aus Beitragsmitteln erwartungsgemäß bekräftigt:„Da es sich bei den Mehrausgaben um beitragsfremde Leistungen handelt, sind diese, wie im AfD-Programm vorgesehen, aus Steuermitteln zu finanzieren.“, so Witt am Rande einer parteiinternen Veranstaltung in .Die von der CSU geforderte Ausweitung der Mütterrente soll laut Informationen der Rentenversicherung sieben Milliarden Euro kosten. Witt sagt: Es freut uns, dass die Deutsche Rentenversicherung uns, als der größten Oppositionspartei im Bundestag, zustimmt, dass die Ausgaben für die Erweiterung der Mütterrente aus Steuermitteln zu finanzieren sind.“ | uwe witt begrüßt vorschlag rentenversicherung steuerfinanzierten mütterrente deutsche rentenversicherung warnt groko sondierer ne regierungskoalition müsse finanzierung mütterrente regeln ausbau dürfe lasten beitragszahlers mehrausgaben rentenversicherung finanzierung zusätzlicher mütterrenten geburten entstehen sach systemgerecht steuermitteln finanziert heißt verbreiteten beschluss bundesvertreterversammlung uwe witt kommissarischer arbeitskreises arbeit soziales leiter bundesfachausschuss soziale sicherungssysteme rente arbeits sozialpolitik alternative deutschland ausweitung mütterrente beitragsmitteln erwartungsgemäß bekräftigt mehrausgaben beitragsfremde leistungen handelt programm vorgesehen steuermitteln finanzieren witt rande parteiinternen veranstaltung geforderte ausweitung mütterrente informationen rentenversicherung milliarden euro kosten witt freut deutsche rentenversicherung größten oppositionspartei bundestag zustimmt ausgaben erweiterung mütterrente steuermitteln finanzieren |
tokens <- pressReleases %>% unnest_tokens(word, text_cleaned1)
tokens.count <- tokens %>%
count(party, word, sort = TRUE) %>%
ungroup() %>%
bind_tf_idf(word,party,n)
tokens.count %>%
arrange(desc(tf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(party) %>%
top_n(15) %>%
ungroup %>%
ggplot(aes(word, tf, fill = party)) +
geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
labs(x = NULL, y = "Term Frequency") +
facet_wrap(~party, ncol = 3, scales = "free") +
coord_flip()
Compare the word frequency for the different parties.
an empty space at low frequency indicates less similarity between two parties.
if words in a two-sided panel are closer to the zero-slope line the two parties use more similar words.
frequency <- tokens.count %>%
group_by(party) %>%
mutate(proportion = n/sum(n)) %>%
select(party, word, proportion) %>%
spread(party, proportion)
frequency %>%
gather(party, proportion, -word, -CDU) %>%
ggplot(aes(x = proportion, y = `CDU`, color = abs(`CDU` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "CDU", x = NULL)
#ggsave("../figs/word_freq_CDU.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -SPD) %>%
ggplot(aes(x = proportion, y = `SPD`, color = abs(`SPD` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "SPD", x = NULL)
#ggsave("../figs/word_freq_SPD.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -FDP) %>%
ggplot(aes(x = proportion, y = `FDP`, color = abs(`FDP` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "FDP", x = NULL)
#ggsave("../figs/word_freq_FDP.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -`B90/GRÜNE`) %>%
ggplot(aes(x = proportion, y = `B90/GRÜNE`, color = abs(`B90/GRÜNE` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "B90/GRÜNE", x = NULL)
#ggsave("../figs/word_freq_GRUENE.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -`DIE LINKE`) %>%
ggplot(aes(x = proportion, y = `DIE LINKE`, color = abs(`DIE LINKE` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "DIE LINKE", x = NULL)
#ggsave("../figs/word_freq_LINKE.png", width = 15, height = 10)
frequency %>%
gather(party, proportion, -word, -`AfD`) %>%
ggplot(aes(x = proportion, y = `AfD`, color = abs(`AfD` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~party, nrow = 2) +
theme(legend.position="none") +
labs(y = "AfD", x = NULL)
#ggsave("../figs/word_freq_AfD.png", width = 15, height = 10)
The statistic tf-idf (term frequency - inverse document frequency) is intended to measure how important a word is to a document in a collection (or corpus) of documents. In this case we measure how important a word is to a party (within all the press releases of that party) in the collection of all parties (and their press releases).
The inverse document frequency for any given term is defined as
\[ idf\text{(term)}=\frac{n_{\text{documents}}}{n_{\text{documents containing term}}} \]
In this case, \(n_{\text{documents}} = 6\) as we have 6 different parties.
Terms with low tf-idf:
tokens.count %>%
arrange(tf_idf)
## # A tibble: 55,124 x 6
## party word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 DIE LINKE bundesregierung 566 0.0109 0 0
## 2 AfD deutschland 470 0.0128 0 0
## 3 DIE LINKE deutschland 330 0.00637 0 0
## 4 DIE LINKE eu 324 0.00625 0 0
## 5 DIE LINKE menschen 286 0.00552 0 0
## 6 AfD deutschen 235 0.00641 0 0
## 7 FDP deutschland 226 0.0109 0 0
## 8 DIE LINKE endlich 219 0.00423 0 0
## 9 AfD eu 208 0.00567 0 0
## 10 DIE LINKE vorsitzend 208 0.00401 0 0
## # ... with 55,114 more rows
A 0 idf (and thus tf-idf) indicate, that these terms appear in all six parties press-releases.
The inverse document frequency (and thus tf-idf) is very low (0) for terms that occur in many (all) of the documents (all press releases of one party) in a collection (all press releases of one party);
Terms with high tf-idf.
tokens.count %>%
arrange(desc(tf_idf))
## # A tibble: 55,124 x 6
## party word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 AfD weidel 185 0.00505 1.79 0.00904
## 2 FDP beer 59 0.00286 1.79 0.00512
## 3 FDP nicola 58 0.00281 1.79 0.00503
## 4 AfD pazderski 98 0.00267 1.79 0.00479
## 5 AfD alic 145 0.00395 1.10 0.00434
## 6 FDP lambsdorff 42 0.00203 1.79 0.00364
## 7 DIE LINKE dagdelen 102 0.00197 1.79 0.00353
## 8 FDP präsidiumsmitgli 57 0.00276 1.10 0.00303
## 9 AfD brandner 62 0.00169 1.79 0.00303
## 10 FDP generalsekretärin 54 0.00261 1.10 0.00287
## # ... with 55,114 more rows
tokens.count %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(party) %>%
top_n(15) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = party)) +
geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~party, ncol = 3, scales = "free") +
coord_flip()
#ggsave("../figs/tf-idf.png", width = 11, height = 6)
Words can be considered not only as single units, but also as their relationship to each other. N-grams, for example, help to investigate which words tend to follow others immediately. To do this, we tokenize the text into successive sequences of words called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.
bigrams <- pressReleases %>% unnest_tokens(bigram, text_cleaned1, token="ngrams", n=2)
bigrams.count <- bigrams %>%
count(party, bigram, sort = TRUE) %>%
ungroup() %>%
bind_tf_idf(bigram,party,n)
bigrams.count %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(party) %>%
top_n(15) %>%
arrange(desc(tf_idf)) %>%
ungroup %>%
ggplot(aes(reorder(bigram, tf_idf), tf_idf, fill = party)) +
geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~party, ncol = 3, scales = "free") +
coord_flip()
#ggsave("../figs/tf-idf_bigram.png", width = 11, height = 6)
trigrams <- pressReleases %>% unnest_tokens(trigram, text_cleaned1, token="ngrams", n=3)
trigrams.count <- trigrams %>%
count(party, trigram, sort = TRUE) %>%
ungroup() %>%
bind_tf_idf(trigram,party,n)
trigrams.count %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(trigram, levels = rev(unique(trigram)))) %>%
group_by(party) %>%
top_n(15) %>%
arrange(desc(tf_idf)) %>%
ungroup %>%
ggplot(aes(reorder(trigram, tf_idf), tf_idf, fill = party)) +
geom_col(show.legend = FALSE, fill = "darkslategray4", alpha = 0.9) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~party, ncol = 3, scales = "free") +
coord_flip()
#ggsave("../figs/tf-idf_trigram.png", width = 12, height = 6)